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1.0 Introduction 


The design of fault-tolerant avionics and control systems 
needs to be supported by an assessment of whether the systems possess 
the level of reliability for which they were designed. Because 
ultra-high reliability requirements exist for such systems, an 
experimental approach based on lifetesting techniques cannot be used 
to evaluate them [1,2] . Analytical models based on stochastic 
assumptions must then be developed to help predict and validate the 
reliability of these systems. 

Early approaches to reliability prediction were based on a 
combinatorial method first discussed by Mathur and Avizienis [3] . 
Their method assumed that the system was a series of subsystems, each 
of which was to be modeled as a hybrid NMR type. The reconfiguration 
mechanism was assumed to be perfect. Bouricius and his colleagues 
extended this model to allow the reconfiguration mechanism to have an 
imperfect coverage [4] . As an embodiment of this notion, the CARE 
program was developed at JPL as a computer-aided reliability 
evaluation package. This was later modified by Raytheon and was named 
CARE II [5] . 

Not all systems of interest can be broken down into a series 
of smaller subsystems. In such cases, combinatorial methods have been 
superseded by more general Markov chain methods. Ng and Avizienis [6] 
have developed a unified model for the reliability evaluation of 
nonmaintained (closed) fault-tolerant systems based on a Markov 
approach. These ideas have been incorporated into a computer-based 
reliability evaluation package known as ARIES [7] . 



Several limitations of the early approaches became evident 
with their use in modeling ultra-reliable, fault-tolerant systems such 
as SIFT [8] and FTMP [9] . First, fault coverage was assumed to be a 
single number, whereas in practice, the times to detect, isolate, and 
recover from a fault are nonzero random variables. Furthermore, these 
quantities do depend on the current state of the system. The 
implication is that the fault-handling behavior of the system needs to 
be modeled and one or more parameters need to be derived capturing the 
coverage aspects. Such a coverage model is already a part of CARE II 
and continues to be an integral part of CARE III [10] . 

The second limitation was the assumption that fault-occurrence 
and fault-handling behavior are simultaneously accounted for by a 
single Markov model of system behavior. This implies a combinatorial 
explosion in the state space of the Markov chain, resulting in 
computation difficulties. It may be recognized, however, that the 
time constants of the fault-handling processes are several orders of 
magnitude smaller than those of the fault-occurrence events. It is 
therefore possible to analyze separately the fault-handling behavior 
of the system (the coverage model) and later incorporate the results 
of the coverage model, together with the fault-occurrence behavior, in 
an overall reliability model. This is the approach used in CARE III. 

The third limitation was the assumption that all random 
variables of interest are exponentially distributed. In practice, 
this is seldom the case. One possible approach to the problem of 
non-exponential holding times is to use the method of stages [11] . 
Indeed, this approach has been used in other reliability models [12] 
and in queueing theoretic models for computer performance 
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evaluation [13]. However, the use of the method of stages increases 
the size of the state space. CARE III is a major departure from 
conventional approaches in that it purports to support non-exponential 
distributions, while avoiding the problem of large state spaces 
through the use of state aggregation. More specifically, CARE III 
uses a combination of semi-Markov techniques (while analyzing the 
coverage model) and time-dependent transition parameters resulting in 
a non-homogeneous Markov chain (at the aggregate model level) . 

In Section 2, we give some background regarding the ideas to 
be pursued in the remaining part of the paper. In Section 3 We 
present several simple examples illustrating important features of the 
CARE III model. In Section 4 we treat the CARE III model in a more 
general fashion with more detailed examples. In Section 5 several 
approximation techniques are discussed. 

2 . Background 

A common approach to solving large problems is to partition 
the problem into smaller parts, and then combine the solutions of the 
parts into a solution for the entire problem. This approach to 
problem solving is known as divide - and - conquer and is considered to be 
very effective in designing alorithms [14] . The same approach is 
often found to be effective in solving large system analysis problems. 
In this connection we refer to the first step of dividing the original 
problem into smaller parts as decomposition and the step of combining 
solutions of parts into the solution for the whole as aggregation. 
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Aggregation and decomposition are simply the complementary 
activities of combining and separating parts of the system to 
facilitate analysis [15] . The decomposition/aggregation approach to 
system analysis will be effective if (i) interactions within a part 
can be studied as if interactions between parts did not exist and, 
(ii) interactions between parts can be analyzed without referring to 
the interactions within parts [16] . 

If we can assume that subsystem failure/recovery processes are 
independent of each other then a decomposition into subsystems, 
separate analysis of subsystems, and aggregation to obtain the final 
solution can be used. In CARE, for example, solution to subsystem 
reliability is obtained using the hybrid-NMR expression and the 
subsystem reliabilities are multiplied (aggregation step) to obtain 
system reliability. 

Unfortunately the assumption of independent behavior of 
subsystems is often unrealistic. Nevertheless, if the coupling 

between subsystems is weak, we may consider the system nearly— 
decomposable [16] and the solution obtained by aggregation will then 
be an approximation to the desired solution. Indeed this approach is 
considered effective in queueing theoretic models of system 
performance analysis. 

In the reliability context, however, there is an alternative 
approach to the above structural decomposition. This new approach may 
be called behavioral decomposition. We observe that the fault- 
occurrence behavior of a system is composed of relatively infrequent 
events while fault-handling behavior of a system is composed of 
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relatively frequent events. It may, therefore, be desirable to 
separately analyze the fault-handling behavior and reflect its effect 
in an aggregate model by one or more parameters. This is indeed the 
approach used in CARE II and CARE III. We must remember that the 
solution thus obtained will, in general^ be an approximation to the 
desired solution. The behavioral approach to decomposition of complex 
reliability models will be explored further in the next two sections. 

Next we should consider the problem of modeling non- 
exponential holding time distributions. By definition of a 
homogeneous Markov chain, the random variable denoting time spent in a 
state must have the memoryless property, that is, must be 
exponentially distributed. This implies a serious assumption about 
the behavior of various fault-occurrence and recovery processes, an 
assumption that is often violated. 

In general, removing this restriction on holding times in the 
states of a Markov chain yields a semi-Markov process, with the 
corresponding difficulty in solving such models. At present, it 
appears that the use of general semi-Markov processes may have to be 
restricted to relatively small problems. Indeed, the coverage 
(fault-handling) model used in CARE II and CARE III uses the general 
semi-Markov approach. 

The use of a general semi-Markov process implies that besides 
the state information, we must also have the time spent in the given 
state in order to predict the future behavior of the process. Thus, 
the effective state space i.s uncountably infinite. For most practical 
problems, however, a lot less information usually suffices. Besides 
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few bits of additional 


the state of the original stochastic process, a 
information regarding the time spent in the state is usually enough to 
predict the future. In other words, we still have a Markov chain with 
a finite (in general, countably infinite) state space, albeit a larger 
one than the original state space. 


As a simple example of the problems involved in removing the 
exponential holding time assumption, consider a component with a 
constant failure rate X (hence exponentially distributed time to 
failure) . The Markov state diagram of the component is shown in 
Figure 1(a). This is a very simple model to solve for the state 



Figure 1(a) - Example Markov Chains. 


probabilities and hence the component reliability. Now, suppose the 
assumption of exponentially distributed time to failure is 
unsatisfactory. Further suppose that the time to failure is a 2-stage 
Erlang random variable v;ith parameter 2 X (hence the mean time to 
failure is the same as before, that is, 1/X) • We can then model the 
behavior of the component using the three-state Markov chain as shown 
in Figure 1(b). In the state (0,A), the component is in failure free 



Figure 1 (b) 
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state and in the first stage of its lifetime distribution. Since each 
stage of an Erlang random variable is exponentially distributed, each 
state of the resulting new state diagram possesses an exponentially 
distributed holding time. Furthermore, in general, given any holding 
time distribution, it is possible to derive an exponential stage type 
decomposition of that distribution to a specified degree of 
approximation [11] . The problem with this approach is that the more a 
given holding time differs from the exponential, the larger the number 
of stages needed to approximate it and the larger the state space of 
the resulting Markov chain. 

Yet another approach to the problem of non-exponential holding 
times is to consider a Markov chain whose transition parameters are 
allowed to be time-dependent. The resulting Markov chain is said to 
be a non-homogeneous Markov chain. Thus, for example, the homogeneous 
Markov chain of Figure 1(a) is transformed into the non-homogeneous 
chain shown in Figure 1(c). It can be shown that the holding time 
distribution in state 0 is now given by 

X(’') dr 

F„ (t) = 1 - e ° 

^0 

Note that if we let X(t) = at^, we have a Weibull holding time 
distr ibution. 
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Although solving for the state probabilities of a non- 
homogeneous Markov chain is somewhat more complex than solving for 
those of a homogeneous Markov chain, the advantage is that the state 
space is not expanded. Another disadvantage is that the transition 
rates are allowed to depend only on global time defined from the 
beginning of the process operation. In order to model an arbitrary 
holding time distribution in a given state i of the chain, we would 
like the transition parameter leading from state i to state j to be a 
function of local time, measured from the time of entry into state i. 
Since the (global) time of entry into state i is a random variable 
(unless state i is the start state) , a simple shift of time origin is 
not adequate to transform local time based quantities into global time 
based quantities. 

When we consider reliability models of systems without 
renewals (or repairs) , the time to failure of a component can be 
measured in global time, and hence the failure rate leading out of 
state i can be labelled in terms of the global time. It is in this 
fashion that CARE III models non-exponential t ime-to-failure 
distributions at the aggregate model level. 

Another apparent difficulty with this approach is met when we 
allow spare failure rates to be different from the failure rate of an 
active unit. Consider a 2-component standby redundant system with the 
active unit failure rate of X(t) and the passive unit failure rate 
c( K(t) • The non-homogeneous Markov chain of Figure 1(d) is a model of 
this system assuming perfect coverage and further assuming that the 
failure rate of the spare unit once activated is only a function of 
its total age. This assumption is graphically depicted in Figure 
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(l+a)X(t) 



Figure 1 (d) 







Active 


Figure 1(e) 


Non-hotnogeneous Markov Chains 



Time 
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3. Motivating Examples 


We shall now present the essential features of the CARE III 
model through a series of motivating examples. 

Example 1: Consider a standby redundant system in which the failure 
rates of the spare and that of the active unit are both 
constant and equal to X* Upon the occurrence of a failure, 
a recovery process takes control and with probability c 
succeeds in recovering from failure. The Markov state 
diagram of the system is shown in Figure 2. 



Figure 2. - Standby Redundant System Markov Chain 
The standard approach to solve for the state 
probabilities of such a process is to set up Kolmogorov 
differential equations and (in the Markov case) use Laplace 
transforms to solve them. We v/ill instead use convolution 
equations [17 ;pp. 483-488] or the method of sample 

paths [18] . The reason for the use of this method is the 
easy generalization to non-Markovian (in particular, semi- 
Markov) processes which we will need shortly. 



I 


The integral equation for the transition probability 
Pij^(t)/ the probability that the Markov chain is in state k 
at time t given that it started in state i at time 0/ is 
given by 

. -Xit t -X. (t-x) 

(1) p.^(t) = e + I S p. .(X) Xj;, e dx 

where is the Kronecker $ function (that is, 

^ii ~ ^ ^ik ” ^ t the transition rate from 

state i to state j and Xj^ = S Xj^j. 

Applying Equation (1) to the current problem and 
remembering that the start state is 0 so that 

= 1 f P^CO) = 0 ij^l, we have the equations for the 
state probabilities P,^(t) = PQi^(t) as follows; 

PQ(t) = 

Pl(t) =1 Pq(x) 2Xc dx 

= 2X c e ^e dx 

= 2c [e"^^ - e"^^^ ] , and 

P2(t) = I Pq(x) 2X(l-c)dx + I Pj^(x) X dx 

2X(l-c) dx + I 2c [e"^"' - ] X dx 

= (1-c) [1-e”^^^ ] + 2c [ (l-e“^^) - ^ (l-e“^^*^) ] 

= 1 - 2c e”^^ - (l-2c) 
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The system reliability 


(2) R(t) = S P. (t) = P. (t) + P, (t) = 2c e"^^+(l-2c) e“^^^ 
k GL ^ ^ 

where L = {0,l} is the set of system states where the system is 
functioning properly. 


« 

In Example 1, we note that the system reaches the failure 
State labelled 2 due to two distinct causes; exhaustion of spares, and 

coverage failure. For the subsequent discussion, we wish to separate 
the probabilities due to these two causes of failure. 

Example 2; We reformulate the state diagram of Figure 2 so that the 
system has five states; in state 0 the system is 

functioning properly without any unit failing, in state IG 
the system is functioning properly with a prior (covered) 
failure of one of the units, in state IF the system has 
failed due to the occurrence of one uncovered failure, in 
state 2F the system has experienced two failures, in state 
2G the system has experienced two failures (both covered) 
but the system has failed due to exhaustion of spares. The 
reformulated state diagram is shown in Figure 3. 

The transition from state IF to 2F may appear 
strange but it is very convenient. As we shall see in the 
next example, if we delete this transition, the state 
probabilities will change but the system reliability will be 
the same. 
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Figure 3. - Reformulated Figure 2 State Diagram 
We will let 


(t) = (t) , and 


Solving for the state probabilities, as in Example 1, we have 

PQ(t) = , 

V^(t) =| pq(x) 2\ dx 

= 2c [e"^^ - 1 , 

=1 Pq(x) 2X(1-c) dx 

= 2(l-c) [e“^^ - 1 , 
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P 2 ^t) = ^ Pl(x)X dx 

= c - 2c + c and 

Q 2 (t) =1 Qi(x) X <3x = (l-c)-2(l-c) + (1-c) 


Note that the system reliability is given by 


R(t) = Z P^(t) = PQ(t) + V^(t) 
kSX< 

= 2c e"^^+(l-2c) , 

as in Example 1 (Expression (2) ) . 
Computing 


Po(t) = Po<'^> 


e 


-2Xt 


P*(t) = 2 [e ], and 

P*(t) =1-2 e”^^ + e"^^^ , 


we note that P^Ct) is independent of the coverage factor c. But 
this should not be surprising if we redraw the state diagram of 

if 

Figure 3 by aggregating states iG and iF into the state i to 
obtain the state diagram of Figure 4. Here all transition rates 



Figure 4. - Aggregated Figure 3 State Diagram. 
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as the 


are independent of c. Interpreting this diagram 

ic 

fictitious situation of perfect coverage, we conclude that P^Ct) 
represents the probability that the system has sustained i faults 
by time t, assuming coverage to be perfect. This reinforces our 
earlier interpretation of P^(t) and Q^(t). 

« 


Example Let us remove the assumption that a failure is allowed to 
occur in another module after an uncovered failure has 
occurred in some module. Therefore, we redraw the state 



Figure 5. - Redrawn Figure 3 State Diagram. 


diagram of Figure 3 as shown in Figure 5. Computing state 
probabilities, we get 


PQ(t) = 


Pl(t) = = 2c(e ^^-e , 
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Note that 

R(t) = Pq ( t) +P^(t) =2c e“^^+(l-2c) 

which is identical to the result obtained in Examples 1 and 2 

(Expression (2)). However, we can no longer interpret 

* 

Qj (t) + Pj (t) = Pj(t) as the probability of being in state j at 
time t were the coverage perfect. 

# 

One problem with models of Examples 1-3 is that the value of 
the coverage parameter c is assumed to be known (specified by the 
model user) . In practice, however, such parameters are extremely 

difficult to estimate. The extreme sensitivity of the reliability to 
the coverage parameter [19] further compounds this problem. It is 

imperative, therefore, to provide a method of estimating coverage 
parameters based on more elementary, easier to specify, parameters. 

Example 4; Consider the Markov model of the fault recovery 

process [20] shown in Figure 6. The model consists of five 
states. In the active state A, the fault is capable of 
producing errors at the rate p leading to the error state E. 

The fault is assumed to be an intermittent type so that 
occasionally it goes into the benign state B, where the 
affected circuitry temporarily functions correctly. In 


16 




state D the fault has been detected while in state F, an 
undetected error has propagated so that we declare the 
system to have failed. 

To illustrate the point of this example, it is more 
convenient to use the Laplace transform method to solve the 
differential equations for the Markov chain. First the 
differential equations are; 

dP, 

^ = -(o(+p+6) Pj^(t) + p P 2 (t), 

^ - -P ' 

^ = -e Pj(t) + p Pj^(t) , 

dP. 

= 6 PjL(t) + q G P3(t) , and 

^ = (1-q) e P3(t) . 
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Applying Laplace transforms and recalling that state 1 is the 
initial state, we get (P^Cs) denotes the. Laplace transform of 


Pi(t) ) ; 


(s Pj^(s)-l) = -(c(+p+6) Pj^(s) + p P2(s) 


or 


Pl(s) 

P2(s) 

P3(s) 


P4(s) = 


p P2(s)+1 
s+c(+p+6 ' 

c(P3^(s) 

“s+p • 

p P;|l (S) 

s+e ' 

8 P3^(s)+q e P3(s) 


, and 


PgCs) = ( 1 -q) e 


s 


Hence , 


P (s) gg 

^ s+(c(+p+8) - II 


■^P 


and P. (s) = I {S+f+i^ ) ( T 

« ^ =-^8 (s+cC+p+S) - flp 


) . 


Although it is possible to invert this transform to obtain the 
probability of detection by time t, P^ (t) , we will be content 
here with finding the limiting probability by using the Final 
Value Theorem of Laplace Transform: 
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lim P, (t) 

t->C30 ^ 


lim s P- (s) = (6+3|-G ) 
s->0 ^ ^ 


1 

(c(+p+6) - c( 


P 4-&-H 

o+p 


This represents the probability that the fault (which occurred at 
time t=0) eventually is detected. It is for this reason that we 
conclude that the coverage factor for this fault model is given 
by 


<3) = = 

Similarly, (1-c) = lim P^(t) = . 


# 


Example The results of Example 4 can be used to calculate the 
coverage factor c, and subsequently this value of c can be 
used in the computations of Example 1 (or 2 or 3) in order 
to evaluate the system reliability. Now the user can 
specify elemental quantities S ,p ,q, c( ,p as needed in the 
coverage model calculations. Thus, for instance, the 
reliability model of Example 2 in conjunction with the 
coverage model of Example 4 gives the following state 
probabilities; 

T» / 1 . \ _ ^“2Xt 

r 

= 2 [e"^^ - e”^^^ 1 , 

= 2 - e-2>^^ ] , 

= [1 - 2e"^^ + ] , 



« 

(4) 

p^(t) 

(5) 

Q^(t) 

(6) 

P2(t) 
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1 , and system 



# 

Use of the hierarchical approach to reliability evaluation 
described so far is supported by a realization that holding times in 
various states of the coverage model (of Example 4) will be several 
orders of magnitude smaller than those in the fault-occurrence models 
(Examples 1-3) . Nevertheless, this approach yields only a first-order 
approximation to the more accurate model that we wish to study. 



Figure 7. - Markov Model of a Two-Unit System. 


Example 6: Consider the Markov reliability model of a 2-unit system as 
shown in Figure 7. Solving for the state probabilities of 
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this Markov chain using the convolution integration approach 


we get (for the sake of simplicity we assume that 
c( = p = 0) : 

PpCt) = e“^^^ , 

P(l^)(t) = I Pq(x) 2 dx 


- 2K ^-(S+P+X)t 
" X-(6+p) ® 


2X -2\t 

X- (5+p)“ ® 


r 


_ 2^ _ 2Xpe~^^~''P^^^ ^ 2Xp 

T5+p-e) (X-€) (X^+p) ) (6+p-e) (X- (»+p) ) (X-e) ' 

'’(1D)"=>=| PdA) (x)8e->'''^-='>dx+| I>,iE)(x)q ee‘>'"^-’‘>dx 


=21|+P3l -xt 
n^+p) 


. r 2X (P q)6 2XS , -(X+S+P)t 

^ f (6+p) (6+^6) (X- (6+p) ) " (X-(S+p) ) (S+p)^® ^ 

- 2X » qe- 

(o+p-G) (X~6) 


_ r 2 (S+pq) 6 2§ 

^ (X-6) (X- (6+p) ) (X- (S+p) 

■I 


) (X-6) 


]e , 


P(1F) (t) 


P( 1 J 3 ) (x) (1-q) e e 


-X(t-x) 


dx 


= 2p(l-q) _ 2X(l-q) p ^-(X+6)t ^ 2Xp(l-q) €e“ ^ 

5+p Wp-6) (X-6) ® + TX- (S+p) ) (5 +^T(S+(d) 


2 p(l-q) 6 e 
“ (X-6) (X- (6+p) ) ^ 
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^(2F) 


P( 1 P) (x)X dx 


■J 

oed-a) np+S)^+(p+S) ( x+g)+X6i _ 2eiizaU-xt 

*' "v “* # t~^il j ^\ I i! rv"\ /\ \ (0"f*o) 


(p+6-X) (^+8-6) (jp+6+X) rx+€) 


2 p\ ^ea-q) e-(X+p+S)t 

2 „2, (X+p+i) (X- (p+ 5 ^) ) (p+ 5 -e) (p+o) 


+ 2 pX ( 1 -q) 

(X^“6^) (p+6-e) 

(X-e) (X- (p+S) ) 


and 


(2G)"=> 'I 1'’(1A)«'^''(1D) «+'■(«)<>'> 




In our earlier terminology, we are now in a position 
to compute Qj^(t) = ^(IF) * 


(9) Qj^(t) = 


2 £(lzaL _ .g ( . § +pj 

d+p ^ (X-e) (X-( 6 +p)) 


- 2 Xt 


] 


2 X P(l-q) ^ _ e e~ ^ 

(S+p-s) X~e [X“ (S+p) ] ( 6 +p) 


Comparing Expression (9) with the earlier Expression (5) 
derived in Example 5, we note that if we let X/e and X/(^+p) 
approach zero while keeping the individual terms non-zero, 
the two expressions become identical in t.he limit. A 
similar argument will show that in the limit all state 
probabilities derived in the present example reduce to those 
derived in Example 5. Thus indeed, the approach in Example 5 
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is a first-order approximation to the exact solution 
derived above. For instance let us compute 


(10) P^(t) " P,1R, (t)+P,iB)<‘>+'’(lD)«^>*'’(lE)<*^> 


, 2 (Stpq) -Xt 

6+p ® 


r 2 i (X-e) (X -S)-P (X-qe)] 
^ (X^) ('X- (p+5n 


] e 


-SXt . 2X p(l-q)e 


-(X+e) t 


M l- 

(0+^6) (X-e) 


+ ^I6-aLn~;kll- 

(S+p) (5+p-s) (X- (5+^ 


g-(X+S+p)t 


In modeling ultra-high reliability systems the 
approximate approach of Example 5 may not be adequate but, 
at the same time, the exact approach of Example 6 can become 
unmanageable when we consider systems with hundreds or 
thousands of modules. We therefore need to pursue 
decomposition approaches which are more accurate than the 
first-order approach of Example 5, yet more manageable than 
the exact solution of Example 6. CARE III provides one such 
approach to handling reliability models with an extremely 
large number of states. 


Example 2 * Continuing with the Markov chain of Example 6 , suppose 
wish to suppress all the details of various states of the 
coverage model, and, with a given number of faults join the 
system, we consider only 2 states; the system has 

experienced a coverage failure or it has not. In the 


we 
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specific case at hand, we aggregate the states lA, IB, ID, 
and IE into a single state IG as shown in Figure 8. 



Figure 8. - State Aggregation for Example 6. 


In order to complete the specification of this 
reduced Markov chain, we need to specify the transition 
parameter a'(t). We shall see shortly that this parameter 
is time dependent (as indicated by the notation) and hence 
the Markov chain of Figure 8 is non-homogeneous . 


The easiest way to compute a' (t) is to refer back to 
the solution of the Markov chain of Figure 6: 


a' (t) = 


<i-qLf!(iE) 




In other v?ords, to compute the effective transition rate 
from an aggregate state, we sum the transition rate times 


the state probability of each state contributing to the 
outward flow in the non-reduced model and divide the sum by 



the total probability of being in any one of the aggregated 
states. In the present case, we get 


a’ (t)=- 


(l-q)6[ 


TSV-e) (X-6) ~ ^ (S+p-e) (X- (S+p) ) (X-e) 


where Pj^(t) is given by (10). 

Dividing the numerator and denominator by 2 e*"^^, we obtain 


( 11 ) 


a* (t) 


^ ® f (8+^6) (X-e) " (XH^+p) ) (6+p-g) ^ (X-(S+p) ) (X-e) ^ 


-et 


Xp 

■ — "TZ T-T 




-Xt 


r -(X-e) (X-^)-p(X-qe) , . Xpd 

' ^ (X-6) (X-(p^) (p+^-G) (X-e) 


z3l 


] e 


-et 


+ 


f X(ep( q-1 )) 

‘ (6+p) (6+p-e) (X- (S+p) ) 


jg-(p+S)t 


where , 


again, 



# 


Now all the transition parameters of the non-homogeneous 
Markov chain of Figure 8 have been obtained and hence the state 
probabilities can easily be found using standard methods (to be 
described in the next section). However, there is a catch in this 
procedure! Before we solve the reduced model of Figure 8, we must 
first solve the full model of Example 6 in order to obtain the 
transition parameter a'(t)! Nothing seems to be gained by the process 
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of reduction. The answer to this objection is that the computation of 
a*(t) can be carried out without solving the full reliability model. 
In fact, in computing a' (t) we need only to look at a very simple 
coverage model (with 5 states in the present case) and then solve the 
fault model of Figure 8 (also with 5 states in the present case) . 
This computation replaces the earlier computation based on the model 
of Figure 6 (with 8 states) . 

Example 8: We now proceed to illustrate the computation of a' (t) using 
the coverage model of Figure 9. We note that the coverage 



Figure 9. - Coverage Model for a’ (t) Computation. 

model will be entered subsequent to a failure in one of the 
two modules at some time r. Let Pj (t-^) denote the 
probability of being in state j 6 {a,B,D,E,f] at time t 
given that the coverage model was entered at time r, We 
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compute these probabilities using the convolution 
integration approach; 

t— 

Pj.(t-T) = ^ p^(x)p dx 


fJP-- 

8+p— 6 


(e 


-G(t-r)_ -(6+p) (t-r) 


i 


t-7^ 


Pj 3 (t-'»') = ^ (p^(x)8+Pjj (X) q 6) dx 

=|±ea _ + 6 (§+pq)-S(S+p) - -(S+p) (t--r) 

5+p T5+p-6) (8+p) (6+p-G) ® 


and 


Pp(t-r)= r Pg(x) (1-q) e dx 


=ll=aie - + Li=a)Je_^-(p+S) 

p+S p+8-e (p+8-e) (p+6) 


In order to compute a'(t), we note that 


, , . . (l-q)€* Prob. of being in state B at time t 

' ' Prob. of being in one of the states B , D ,E ) at time t 


h 


(1-q) e Jp^ (t-r) * p(cov. model entered in the interval (r,r+dr)) 


I 


[p^(t-r) +pg (t-r) +pp(t-r) +pg (t-r) ] *p (cov. model entered in the interval (■ 


(l-q)e 


I 


Pj3(t-r)Xe"^’’d r 


|(l-Pp(t-r)) Xe"^’’d r 


(l-q)GN 

D 


,-»-+dr) ) 
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First evaluate the numerator N, 

“ ■(■8+^er()Fer'^ (X-enx- {6+^)1 " (6+^e) (x- (&+p)T 

In a similar fashion, D is computed and to our pleasant surprise, 
we find that the ratio exactly matches with our earlier 

expression for a' (t) given by (11) • 

# 

The method described in the last example can be extended to 
more complex coverage models, and the results of the coverage model 
calculations can then be plugged into the overall reliability model 
which will necessarily be a non-homogeneous Markov chain. These 
extensions are developed in the next section. 

4 . CARE III Model Development 

As pointed out in the last section, two major concerns with 
any advanced reliability prediction model are: 

1) the problem of very large state spaces, and 

2) the desire to include non-exponential holding times. 

The CARE III approach to the first problem is the state aggregation 
(or decomposition) method, and the approach to the second uses a 
combination of semi-Markov techniques (at the coverage model level) 
and time-dependent transition parameters resulting in a non- 
homogeneous Markov chain (at the aggregate model level) . 
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As noted earlier, non-exponential holding times within the 
coverage model are handled using the sample path enumeration method. 
Let us examine the approach of non-homogeneous Markov chains used in 
CARE III to deal with non-exponential holding times in states outside 
the coverage model. As seen in Example 7, even if all holding times 
are assumed to be exponentially distributed in the original model, 
derived transition parameters of the aggregate model are time 
dependent, hence the temptation occurs to use time-dependent 
transition parameters to model non-exponential holding times in the 
fault-occurrence model. 

One problem occurs in using this approach. The time 
dependency of transition parameters can be easily handled, provided 
the time is measured from the beginning of system operation (global 
time). Hov;ever, non-exponential holding times in a state naturally 
give rise to time-dependent transition parameters associated with all 
arcs emanating from the state, with time measured from the point of 
entry into that state (local time) . Suppose, for example, we wish to 
model the holding time in state i to be Weibull distributed with the 
hazard rate X(’’) = a and suppose there is only one transition out 
of state i to state j; then we must label the (i,j) transition with 
parameter Xj^j (‘^) = a where time is measured from the time of the 
last entry into state i. Now the global t is related to r by 
t = T^ + r where is the global time to the last entry into state i. 
Note that is a random variable and hence a fixed time translation 
will not suffice, in general. 
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The argument to be used here in favor of CARE III is that all 
failure processes can be assumed to start at the beginning of system 
operation; hence, the global time can be used to assign time-dependent 
transition rates to all arcs due to failure events. Of course, this 
argument breaks down if renewals (repairs) take place. However, as 
per the interpretation in Section 2 (Figures 1(d) and 1(e)) non-unity 
dormancy factor (that is, spare failure rate being different from 
active failure rate) can be handled. 


We will develop the general approach to non-homogeneous Markov 
chains and its use in the CARE III model in the next three 
subsections. 

4.1 Non — Homog e neou s Markov Cha ins 

Consider a discrete-state continuous parameter Markov chain 
{x(t), t ^ O}. Let the transition probabilities 

Pij(V/t) = P(X(t)*j I X(v) = i) 

for 0 <v£t and i,j = 0,1,2, 


where we define 


Pij(t,t) = 


1 

0 


f if 1=3 
, otherwise * 


The Markov chain {x(t), t ^ O} is said to be 
( time ) - homogeneous (or is said to have stationary transition 
probabilities ) if Pj^j(v,t) depends only on the time difference (t-v) . 
Let us denote the state probabilities at time t by 


P|^(t) = P(X(t)=k) , k=0,l,2,... and t > 0 . 
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We assume that state 0 is the initial state and hence, 

P,^(t) = pQ^(0,t) . 

The transition probabilities of a Markov chain {x(t), t ^ O} 
satisfy the Chapman-Kolmogorov Equation [21]; for all i, j in the 
state space, 

(12) p^j(v,t) = Z p^j^(v,u) P|^j(u,t) 0<v<u<t. 


The direct use of (12) is difficult. The state probabilities 
are usually obtained by solving a system of differential equations 
that we derive next. Under certain regularity conditions, we can shov? 

that for each j there is a non-negative continuous function qj (t) 
defined by 


qj (t) 




= lim 
h->0 


Pj j (t,t)-Pj^ (t,t+h) 


lim 

h->0 


l-Pjj(t,t+h) 

h 


Similarly for each i and j (/i) there is a non-negative 
function q£j(h) defined by 


qij(t) 




= lim 
h->0 


Pj^j (t,t+h) -p^ j (t, t) 

- J 


lim 

h->0 


Pj^j (t,t+h) 


continuous 
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Then the transition probabilities and the transition rates are related 
by : 

Pjij(t,t+h) = q^jCt) * h+o(h) , iffj 

and 

Pjj(t,t+h) = 1 - qj(t) * h+o(h) , i=j . 

Using (12) , it is possible to obtain a differential equation for the 
state probability (t) : 

The linear first-order differential equation is easily solved 
using standard calculus techniques [22, pp. 53-57] to obtain the 

convolution integral form of Pj(t) (analogous to Equation (1) for the 
homogeneous case) : 

. -| qj(r)dr 

(14) P.(t) = P.(0)e " + ^ J P.(x)q..(x)e ^ dx . 

J J i?^j 0 ^ 

The first term on the right-hand side will be zero for all but the 
initial state. 


^o(h) is any function of h that approaches zero faster than h: 


lim 

h->0 


o(h) _ 
h 
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Example Consider the slightly general version of the Markov chain 
in Figure 4, shown in Figure 10. Applying Equation (14), we 
get various state probabilities: 

- 2 ? T 

Pj(t) = e ^ 

t * “I r 

= r ^0^^^ 2X(x) e ^ dx , and 

^2 “ *1 X(x) dx . 

Given the function X(t) , it is possible to numerically evaluate 

* * * 

the state probabilities in order PgCt) , P^(t) , P 2 (t). We will 

assume that our aggregate reliability models will not have any 
renewals or repair type transitions; it will always be possible 
to order the states in this fashion. It should be noted that the 
Markov chain of Figure 10 represents the twice collapsed version 
of the Markov chain of Figure 7. The first level of collapsing 

was done to Figure 8; now if we further collapse states IG and IF 

* * 

into state 1 , states 2G and 2F into state 2 , and relabel state 

* 

0 as 0 , we obtain the diagram of Figure 10 (albeit, with the 
addition of time-dependent transition rates) .. 

# 



Figure 10. - Generalized Figure 4 Markov Chain. 
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with CARE III models, it is alv/ays the case that the 
reliability model under the assumption of perfect coverage will be a 
generalized version of the model in Example 9. Furthermore, we will 
need the state probabilities for the perfect coverage case in order to 
evaluate the state probabilities of the model with imperfect coverage. 



Figure 11. - Perfect Coverage Case. 


In the perfect coverage case, we let Xj^j(t) denote the 
transition rate from state i to state j (due to a failure event) and 
let (t) = 2Xi^(t) (see Figure 11). Then the state probabilities 

^ j ^ J 

are written as: 

-j X^(’*)<3 r -JX^(^)d r 

* * 0 -’ 

(14a) P. (t) = P (0) e S J P,- (x)X,.^ (X) e dx . 

4 . 2 Reliability Models v/ith Imperfect Coverage 

The general structure of an aggregate CARE III model is shown 

in Figure 12. The perfect-coverage version of this chain, with states 

* 

jG and jF collapsed into state j , is shov;n in Figure 11, where we 
necessarily have 

(15) Xj(t) = 2 Xj,^(t)=9j (t) + S/j,^(t) + S 0j^(t) . 
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The transitions are due to preexisting latent faults that cause 
a coverage failure without additional faults occurring. The 9^j(t) 
transitions are due to the occurrence of a fault that either by itself 
or in conjunction with preexisting latent faults, causes an immediate 
coverage failure. 





Figure 12. - General Structure of CARE III Aggregate Model. 


Using the convolution integration approach (14), we can write 
the state probabilities: 


(16a) 


-|Xj(r)dr 


dx 


and 

(16b) 


Qj(t)=P^.p)(t) 


= S f Qi(x)X..(x) e 

, -jXj(r)d r 

S r P.(X) e..,(x) e d 

i?^j 0 ^ 

^ -JX.,(r)d r 

1^3 


+ ( Pj(x)i)j(x) e 


dx . 
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For numerical reasons, (Qj(t) is typically close to 0, while 

Pj(t) is close to 1) the computations of Qj(t) (using (16b)) were 

found to be less prone to round-off error accumulation than those of 

Pj(t) (using (16a)). Further, although the Qj(t) depends directly 

upon Pj(t), it has been the experience of the implementors of CARE III 

* 

that replacing Pj(t) by Pj(t) in the Equation (16b) for Qj(t) does not 
cause excessive errors, and those introduced are on the conservative 


side of under - estimating system reliaj^ility . Therefore, we can write 

') dr 


(17) Q. (t) s 2 f Q (x)X.. (X) 
=> iT^j 0 ^ . 

+ £ f Pj (X) 9( .(X) e * 

ifij i ^ 

t . -i 

+ S P j (*) 0 j (x) e 


ability. 

-I 


X 


dx 


(r)dr 


dx 


dx 


Thus, v;e first compute Pj(t) (perfect-coverage case; using 
(14a)) and then compute Qj(^) using the above approximation 
system reliability is given by 


Equation 
(17). The 


R(t) = l-( s: Q. (t) + 5:_ P. (t) ) 

jCL J jCL 

= l-( 2 Q.(t) + P*(t) - 2_Q.(t) ) 

j6L J jSL jSL J 

v;here L is the set of good (system operational, given perfect 
coverage) states and L is the set of bad states. 
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Before calculation of Qj(t) can be carried out, the transition 
parameters (t) , j (t) , /^jCt) , and i)j(t) have to be specified. 

Of these, Xj^j(t) will be user specified, and the remaining parameters 
will be computed based upon the user specified coverage and failure 
rate parameters. 

We have already seen one of the 9 j (t) transitions in Example 7 
where we named it a' (t) . The next example illustrates a case with 
non-zero ©j^j(t) transitions. 

Example 10 ; Consider a special case of the (permanent fault) 
reliability model of the Fault-Tolerant Multiprocessor 
(FTMP) . Assume that there are n processors each with a 
constant failure rate X. Upon occurrence of a fault there 
is exponentially distributed detection latency of rate S. A 
fault is ultimately detected with probability 1 but if a 
second fault occurs while another is latent (within its 
detection latency phase) , a coverage failure is said to have 
occurred. Figure 13 shows a portion of this reliability 
model . 



Figure 13. - Abbreviated Fault-Tolerant Multiprocessor Model. 
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Figure 14. - Perfect Coverage Markov Chain. 


First v;e solve for the state probabilities Pj(t) 
assuming perfect coverage, using bhe Markov chain in Figure 14 


P*(t) 


P^(t) 


= ^-nXt 


. nX . dx 




* k * 

P 2 (t) = r P]^(x) . (n-l)X e 


- (n-2)X(t-x) 


= (") 2)Xt„_g-Xtj2 


and, in general. 


(18) Pj(t) = d-e"^^)^ , j = 0,1, ...n 


Next consider the reduced version of Figure 13 shov/n in 
Figure 15. If we wish to compute Qj(t) using Formula (17), then 
we first need to derive an expression for 9. , ^ (t) . Now it is 

j ■‘■tj 

easy to see that 


(n-j+l)X P, . , . - (t) 

(19) e._^^.(t) = A(t)+P(j-1) y . 
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In order to apply this formula, we need to obtain the state 
probabilities ^(j-1) chain of Figure 
13, but this is precisely what we wanted to avoid! 

We note, however, that we do not need the actual values 
of these probabilities but merely the ratio 


P 

y “ p I 

(j-l) A 
Note further that 
a latent fault 
faults. 


HzLIJl 


(t) 


^ ^j-i) * 

y is the conditional probability that there is 
given that the system has experienced (j-l) 



Figure 16. - Example Markov Chain with Coverage. 


We claim that a good approximation to this probability can be 
obtained from a very simple 3—state coverage model for the 
specific module which has experienced the fault that forced the 
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system into state (j-i) A. Consider the Markov chain of Figure 
16. We claim that if we compute the state probabilities for this 
chain, then the ratio 


9 = (j-1)^ 


P( state A at time t) 


P(state A at time t) + P( state D at time t) ' 

is approximately equal to the ratio y that we seek! The factor 
(j-1) here represents the number of ways in which we could find 1 
latent fault among the j-1 faults present in the system, that is, 
. In a more general setting we would need the number of 
ways p latent faults can be found in the system given that it has 
experienced j faults, that is, (^) as the multiplying factor. 

r 

We now proceed v/ith the computation: 

P( state A at time t) 


t 

^ * P( cov . model entered during (x,x + dx) ) 




P (state D at time t) = 


] , and 

similarly 


rf 

II 

^(x) Sdx 


_ X 5 


-Sx 



- e 


1-e-^^ 

' X - 

l-e-S‘ 
—5 

= 1 - 

six + 

(D 

1 
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I 


Hence, the required ratio is 


9 = 


(J-l) 5^ 




1-e 


which implies that 

(20) • 

The reader is urged to check that computation of based on 

Formula ( 19 ) gives exactly the same answer as ( 20 ) whereas the 
exact computation . of ©23^^^ gives a somewhat different answer 
from that produced by ( 20 ) . However, for small enough values of 
X/ 6 , the two answers tend to be rather close; for example, if we 
take X = 10”^ failures/hour and detection rate 

6 = 10 (X/S = 10 ') v/e find that the tv?o values, 
©22 (t) decimal places, for any time 

t > 0. 


Now applying Formula ( 20 ) , we have 


Qj(t) -|Qj_i(x) (n-j+l)X 


“ (n-j)X(t-x) 


dx 


I P-.iW (n-j+l)X 4^^ dx. 


Where Qq (t) = 0 ,and P. (t) is given in ( 18 ). 
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Then using (18) , we have 


Qj(t) = (n-j+l)X Qj_]^(x)e^”“^^^^dx 

+K ^ >c-(n-j+l)Xx(l-e‘^^) X ,^-Xx ^-Sx, (n-i)XxT . 
+^(j_l)e ( 3 . 1 ) ^(e _e )e' dx] 

= (n-j+l)X [| Q._^M dx 

n(n-l) 1 (1-e"^^) . 


Example 11 ; We can observe another transition of the ijj type when we 
consider the 2-unit system of Figure 17, which incorporates 



Figure 17- Two Unit System Model with Full CARE III Single 
Fault Coverage Model. 
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the full CARE III single fault coverage model. Errors are 
generated at rate p only from the active state, but, once 
generated, can propagate from either active or benign 
states. 

The corresponding perfect coverage model was given 
in Example 2, and, if c( = p = 0, we are here carrying out 
the analysis begun in Example 8. 

To avoid the unpleasantness of recursive formulas, 
we consider the case p = 0 (though o( need not be 0) . Now 
clearly 


n (t) - (state AE at time t)*P(state BE at time t)1 
'1 ' ' P (state A or B or AE or BE or AD or BD at time t) 

(i“q)e |[Pae "^Pbe 

|(l-P3^p(t-x) ) Xe"^^dx 

where again Pj (t-x) = P(coverage model state j at time t, 
given entry at time x) . Using convolutions, we have 


P+6-G 


je-(o(+e)r _ ^-(c(+p+S)r 


T 

p+s-e It® e ) ( ;^+p+S-e 


-) ] 


so that 




Pae^^ "*’Pbe 



Further , 


r 

{i-q)6[ p^T3 (y)+P3E (y) l <3y 


_ llz9LL_ee-®’’ + (I-q)€p 

c(+p+o c(+p+o-€ (c(+p+o-S) (c(+p+6) 


e r 


so 


# 




(l-q)ep K e~^^ (l-q)gpXe~^^ (1-g) SpXe~ 

(X-6) (X- W+5+^ W+P+M) (X-6) (c(+p+6-e) (X- {di+p+6) ) 


c(+6+pq 

p+p+S 




Xp(l -q)e ^ , Xfp(q- l) e~<^'^P'^^^*^ .. 

(c(+p+ 5 ^g) (X- 6 )^ ^ (c(+p+ 5 ) (c(+p+S- 6 ) (X- W+p+STT^ 


Now OgCt) = 0, and Q^(t) may be approximated by 


Pj^(x) i)j^(x) e 

'|(l-e"^’') i>^(x) dx 


dx 


= 2 e 


Since the rate X is, in practice, several orders of 
magnitude smaller than any rate / 6 {c( ,S ,p ,6}, it is 
reasonable to consider 0]^(t) as ^ — >0; in this case the 
integral simplifies considerably; 


Ql(t) 


^i%-f 2 - e-2>''=) 
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I 


This same consideration (X / / — >0) can be used here 

* 

to gain an estimate of the error introduced by using Pj(t) 
rather than Pj (t) in computing Qj (t) . If we write 
ij^(t) = N(t) / D(t) , then by solving the complete Markov 
model of this example (with no separation of coverage) we 
find 

Pj^(t) = 2 e"^^ D(t) 

so that i?Q^(t) = 2 e N(t) / P^(t), thus giving us a 
fortuitous cancellation in 

Qj^(t) = I Pj^(x) ij^(x) dx . 

If X/y — > 0 now we obtain 


which can easily be compared with the earlier approximate 


value to show the extent of the under-estimation of system 

* 


reliability introduced by using P 

‘ Qj^(t)-0j^(t) ■ 




1 * 


Note that unless 


IS 


small compared to 1 (equivalently, ^5+p~q 


is small compared to 1) , the error is substantial. 


Example 12 : Transitions of all four types 

be illustrated in the 3-unit model of Figure 18, which 


# 

can 


incorporaites the CARE III double-fault coverage model, as 
well as two CARE III single-fault models. 


45 



1 


ON 



Figure 18. - Three Unit System Model with CARE III Double Fault 
Coverage Model and Single Fault Models. 



The occurrence of a second fault while the first is 
in an active state or AE^^) causes immediate system 

failure, thus a 9^2 transition. Should the first fault 

be in a benign state, the second forces entry into the 
double fault model, from which we have detection or system 
failure, the latter due to either both faults becoming 
active or a single active fault generating an error. Thus 
system failures from the double fault model are of the 1^2 

type, as are uncovered propagated errors from the states 
AE2 and BE2. Of course, we still have 9^^ type transitions 
from states AEj^ and BE^ as well as the obvious and 

types present in Example 11. The corresponding perfect 
coverage model of Figure 19 is easily seen to have solution 


PflCt) 


e-3>vt 


P^ct) 


= 3( , 


Pj(t) 


3 ( e“^^ - 2 e“^^^ + e 


and 


P3(t) 


1 - 


e-3Xt^3 ^-2Xt 


- 3 e 


-Xt 



Figure 19. - Perfect Coverage Model for Figure 18 System. 



Again Qq ( t) = 0, and 


Ql(t) P* 


(x)i>j^(x) e 




dx 


whstr© is computsd using a singls—fault covsraga modal 
identical to that in Example 11^ and X^{r) = 2 X. Thus 


Q^it) s 3e"^^^ |(l~e"^’') ijj^(x) dx. 


If we again consider ^ — >0, then, using the computation of 
this integral already carried out in Example 11, we have 


Ol(t) (e-2Xt.^-3Xt) _ 

Here, again, if we use rather than P* we obtain 

Now Q^it) = r Q^{x)X^^{x) e ^ 


dx 


+ JP, (X) 9t,(x) e ^ 


h 

+ <1 ® 


dx 


-/xoC’’) <a ’■ 


dx . 


Clearly, = 2X ,^2 = X, and 


^2X [P (state Aj^ at time t)+P(state AE, at time t) ] 

ejj(t) _ 

Which is again easily computed as in Example 11, using the 
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single-fault model in isolation. Finally, as mentioned 
earlier, IJ 2 has two components, one from the double-fault 
model; 

(p+p) [P (state AB)+P (state BA) 1 
* P2(t) 

and one from the second single-fault model; 

(l-q)e [P(state AE 2 )+P(state BE 2 ) 1 

-pJTt) 

Note, however, that this "second single-fault model" is 
actually an integral part of the "double-fault model", and 
it is this joint coverage model which must be considered 
when calculating these last ratios in the style of Example 
11 . 

If we again restrict ourselves to p = 0 and consider 
y — > Q, then we obtain 

2 W+d+p) 


4.3 Coverage Model Calculations 

In this section, we consider general methods of deriving the 

transition parameters Qj(t) and 0^j(t). It should be noted from our 

* 

previous examples that since we only compute Qj(t) and Pj (t) , we never 
need the parameters )^^j(t). 
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First we consider the parameter r^j(t). This parameter arises 
from those latent faults that give rise to a coverage failure in 
absence of any additional faults. It is certainly possible, in 
general, for a single latent fault by itself to give rise to a 
coverage failure. This situation will be captured by a general single 
fault model that we will discuss. It is also possible for a 
combination of interacting (non-independent) latent faults to give 
rise to a coverage failure. To capture such a situation, we have to 
consider all possible system states due to such an interacting set of 
latent faults. In order to avoid such complexity, we only consider 
all pairs of interacting latent faults (in a general double-fault 
model) and assume that a third (interacting) fault will immediately 
give rise to a coverage failure. 

Unlike all the examples we have considered earlier, the tv;o 
coverage models we consider here are semi-Markov processes. The main 
references on this topic are [23, 24). A semi-Markov process shares 

with a Markov process the property that state transitions are 
regeneration points obliterating the influence of the past. However, 
the holding time in a state is no longer assumed to be exponentially 
distributed. Thus we will model the time dependency of transition 
parameters v/here the (local) time is measured from the entry into the 
specific state. This is in contrast to the non-homogeneous Markov 
chain where the time-dependency of transition parameters with respect 
to global time only was allowed. 

Consider a general semi-Markov process shown in Figure 20. 
Events that cause a transition from state i to state j occur at the 
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Figure 20. - General Semi-Markov Process 


rate (t) , independently of the events causing other transitions. 
Let 


Pi-j(t) 


Tij(t) 


hij(t) r^j(t) where 

-| ' 

e 


Unlike the Markov case^ we prefer to label the arcs by the 
corresponding pdf's (pj^j(t)) rather than by transition rates* Thus in 
this notation, an arc labelled c( in a Markov chain will nov; be 
labelled e . Let F^j(t) be the (unconditional) probability that a 
transition from state i to state j occurs in (local) time with 
duration < t. Then 


Fij(t) = 


where r^j^(t) 




- ^-| Pik 


(T) d r . 


Fij(t). 


Let f (t) be the derivative of 


Note that r^^j^(t) is the 



conditional probability that no transition be made to state k by time 
t. The holding time distribution in state i is then given by 

Fi<t) = Z F. j(t) . 

Feller [23] shows that 

(21) = S.^d-F.(t)) + z| fij(x)Pj^(t-x) dx 

where is the Kronecker delta function. It should be noted that in 

the Markovian cases we used the forward Equation (1) all along. In 
the semi-Markov case the forward equation is much harder than the 
backward equation above. 

Example 13 ; Consider the single-fault model shown in Figure 21 . 
Applying Equation (21) to the present problem and 
remembering that state 0 is the initial state, we get 

# 

= aft) r(t) a(t) + I d(x)r(x) Pgj^(t-x) dx 

Pba<‘> = I P 

t t “X 

p^^(t)=d(t) r (t)a(t)+p fcCe"^^ d(x)r(x) e"P’'p^^(t-r-x)dr dx 

J t 

[c(e"^^d(x) r (X) ^ e“P p^^(y) dy dx] 

=e"^^ d(t)r{t)+p| |.^c(e"^^ d (x) r (x) e“P p^^{y) dx dy. 

Let <l>(t-y) = ^ ^ e“^^d(x) r(x)e”P^^”^ dx 
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and where 


<3(t) = 1-| 6(r)d r , 

a(t) = 1 - I dr = , 

r(t) = 1 - ^ c((r)d T 
Hence 



Figure 21, - CARE III Single-Fault Model 
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Similarly, we have. 


AB 


(t) = <l)(t) + p |t(t-y)P;^B(y) <3y . 


f these 

probabilit 

ies are 

computed 


t 

k(r)dr 


PA(t) = 

CX(x) e ^ 




t ~ 

fX(’’)dr 


PB(t) = 

S X(x) e 


Pa 3 (t-x) 


, we can compute 
dx 

dx . 


Finally, the contribution to i?j (t) by the single fault 
model, denoted by a' (t) is given by: 

_ P'p(t) 

^A (t) Pfi 
D D ^ 

# 

A similar development can be given for the double fault model, 
however, we omit the details here. 


5. Convolution Approximations 

Numerous computations of convolution integrals of the form 
^f(t-r) g (T) dr are required in CAKE III reliability estimation. 
Since one of the functions, say g, is typically from a coverage model, 
while the other, f, is from the fault model, one can exploit the fact 
that f will vary slowly (relative to g over the interval of interest) 
to obtain an easily-computed approximation to the convolution 
integral. Specifically, f(t-T) can be replaced by the quadratic 
interpolation polynomial in r which passes through points 
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(0,f(t)) , (t/2 ,f(t/2)), and (t,f(0)). Writing this polynomial as 

2 

a(t)+b(t) "T+cit) T , one obtains 


I 


f(t-r)g(T)d r £ a(t) J g(r)dr+b(t) J rg(r)d r+c(t) j r^g(r)dr 


j 


As an example, consider again the computation of in Example 

11 of section 4.2 (this is a'(t]3.) in CARE III notation). In the 

first formulation of the CARE III model we see 
t 


a' (til) = 


_^p'p(t-r) r (r) K(r)d r 


1 - r(t) 


where p'p(t-r) = (l-q)e r(r) = e~^^, and 

X(’') = X* In the later formulation we see 


a' (tjl) 
where h„(t) 

r 


hp(t) 

1-r (t) 

is defined as 3p(t) mp(t) + bp(t) m'p(t) + Cp(t)mp(t). 


But of course 


{ 


P'p(t-r) 


r(r) X(r) d r 


i 




r(t-r) X(t-r) d r, 


so 


r(t-r) X(t-'»’) is the fault model function f, p'p is the coverage model 
function g, and 


mj(t) 





d r , 
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III 


Finally, the careful reader has perhaps noticed a difference 

in denominator between the expressions for a' (t]!) above and those 

given in Example 11 of section 4. This difference is introduced to 

* 

compensate for the substitution; again, if we write 

ij = a' = N/D, then 

Pi 9 = P^(2 e"^^N / 2e'^^ D) = P^(2e"^^N / ’P-^) = 2e"^^ N . 

* 

If v/e plan to substitute for Pj^ in P-j^(N / D) , then an exact 

* -Xt 

compensation occurs if we also substitute P^ / 2e for D; but 

P^ / 2e~^^ = l~e“^^ = l-r(t), as above. 

6.0 Concluding Remarks 

CARE III is an advanced reliability prediction tool developed 
by Raytheon under the sponsorship of NASA Langley Research Center. 
Because of sophisticated mathematics employed by CARE HI, it was 
deemed desirable to provide an independent view and a tutorial of 
various important concepts employed. As of this writing, details of 
CARE III are evolving, and therefore, no attempt has been made to 
track its developments in complete detail. Most of the concepts 
outlined here remain valid in spite of the later changes to CARE III. 

Major notions used in CARE III are that of behavioral 
decomposition followed by aggregation in an attempt to deal with 
reliability models with a large number of states. A comprehensive set 
of models of the fault-handling processes in a typical fault-tolerant 
system have been used. These models are semi-Markov in nature, thus 
removing the usual restrictions of exponential holding times within 
the coverage model. The aggregate model is a non-homogeneous Markov 
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chain, thus allowing the times to failure to possess 'Weibull-like 
distributions. Because of the departures from traditional models, the 
solution method employed is that of Kolmogorov integral equations, 
which are evaluated numerically. 

There are several sources of errors in the CARE III model. 
First, the decomposition/aggregation process involves the error in 
estimating the transition parameters such as 0. , .(t) on the basis of 
the analysis of a single module rather than the entire system. 

•k 

Second, the substitution of Pj(t) in place of Pj (t) in solving for 
0j(t) introduces errors. Similarly, the states are treated as 
terminal states in the actual CARE III model (refer to Examples 2 and 
3) which introduces errors. It is recommended that a theoretical 
analysis of these errors be carried out and bounds on these errors be 
obtained. Experimental analysis of these errors is also desirable. 

Yet another source of errors is numerical in nature. The 
numerical integration carried out to obtain 9j(t) involves 

discretization and round-off errors. The convolution integration in 
solving for coverage models contains truncation errors. These errors 
also need to be analyzed. 
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